Automatic Web Page Classification
نویسنده
چکیده
Aim of this paper is to describe a method of automatic web page classification to semantic domains and its evaluation. The classification method exploits machine learning algorithms and several morphological as well as semantical text processing tools. In contrast to general text document classification, in the web document classification there are often problems with short web pages. In this paper we proposed two approaches to eliminate the lack of information. In the first one we consider a wider context of a web page. That means we analyze web pages referenced from the investigated page. The second approach is based on sophisticated term clustering by their similar grammatical context. This is done using statistic corpora tool the Sketch Engine.
منابع مشابه
A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification
In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...
متن کاملResource Optimization in Automatic web page classification using integrated feature selection and machine learning
Increasing with the number of users, the need for automatic classification techniques with good classification accuracy increases as search engines depend on previously classified web pages stored in classified directories to retrieve the relevant results. Preprocessing is the important step in web page classification problem as most of the web pages contain more irrelevant information than rel...
متن کاملAutomatic Web Page Classification
To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory Project (http://dmoz.org) manually maintain a hierarchical structure. While manual classification of web pages provides high accuracy, it is very expensive. To automatically include new emerging pages into these hierarchies, web page classification becomes a hot research topic in web infor...
متن کاملAn n-gram Based Approach to the Classification of Web Pages by Genre
The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre. This research involves the development ...
متن کاملAutomatic Web Page Categorization by Link and Context Analysis
Assistance in retrieving documents on the World Wide Web is provided either by search engines, through keyword-based queries, or by catalogues, which organize documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult, due to the sheer amount of material on the Web; it is thus becoming necessary to resort to techniques for the automatic classific...
متن کامل